Machine learning algorithms using national registry data to predict loss to follow-up during tuberculosis treatment

Background Identifying patients at increased risk of loss to follow-up (LTFU) is key to developing strategies to optimize the clinical management of tuberculosis (TB). The use of national registry data in prediction models may be a useful tool to inform healthcare workers about risk of LTFU. Here we developed a score to predict the risk of LTFU during anti-TB treatment (ATT) in a nationwide cohort of cases using clinical data reported to the Brazilian Notifiable Disease Information System (SINAN). Methods We performed a retrospective study of all TB cases reported to SINAN between 2015 and 2022; excluding children (< 18 years-old), vulnerable groups or drug-resistant TB. For the score, data before treatment initiation were used. We trained and internally validated three different prediction scoring systems, based on Logistic Regression, Random Forest, and Light Gradient Boosting. Before applying our models we splitted our data into training (~ 80% data) and test (~ 20%) sets, and then compared the model metrics using the test data set. Results Of the 243,726 cases included, 41,373 experienced LTFU whereas 202,353 were successfully treated. The groups were different with regards to several clinical and sociodemographic characteristics. The directly observed treatment (DOT) was unbalanced between the groups with lower prevalence in those who were LTFU. Three models were developed to predict LTFU using 8 features (prior TB, drug use, age, sex, HIV infection and schooling level) with different score composition approaches. Those prediction scoring systems exhibited an area under the curve (AUC) ranging between 0.71 and 0.72. The Light Gradient Boosting technique resulted in the best prediction performance, weighting specificity and sensitivity. A user-friendly web calculator app was developed (https://tbprediction.herokuapp.com/) to facilitate implementation. Conclusions Our nationwide risk score predicts the risk of LTFU during ATT in Brazilian adults prior to treatment commencement utilizing schooling level, sex, age, prior TB status, and substance use (drug, alcohol, and/or tobacco). This is a potential tool to assist in decision-making strategies to guide resource allocation, DOT indications, and improve TB treatment adherence. Supplementary Information The online version contains supplementary material available at 10.1186/s12889-024-18815-0.


Introduction
Despite the widespread availability of curative treatment of tuberculosis (TB), this disease remains a major plague of humanity, accounting for more than one million deaths annually [1].Global treatment success is still below the targets established by the World Health Organization (WHO) [2,3], especially in low-and middleincome countries (LMIC) such as Brazil [4].
Current WHO treatment recommendations for drugsusceptible TB include six months of a combination of antibiotics [3].Such long treatment is associated with an increased risk of loss to follow up (LTFU) and may lead to adverse drug reactions [2].Early identification of patients at high risk of LTFU at the moment of diagnosis with clinical and sociodemographic characteristics is key to providing personalized care, which may involve directly observed treatment (DOT), and helping decision-making strategies to mitigate losses in the cascade of care.Noteworthy, the Brazilian Ministry of Health recommends DOT for all TB cases, but the rates of cases that carry out the DOT still represent less than 50% of the total cases reported.To do so, the establishment of reliable and accurate prediction tools [4] is necessary, especially when limited resources require prioritization of intensive case management tools with a high-middle TB disease burden .
Brazil is among the countries with the highest number of TB cases in the world, despite the fact that it follows the WHO's standardized TB treatment recommendations.Importantly, the cascade of care in Brazil for drug-sensitive TB is composed of 3 steps: (1) mandatory reporting of TB cases to the Notifiable Diseases Information System (SINAN) [5,6]; (2) a six-month treatment regimen, usually in fixed-dose combination (FDC) [7]; and (3) treatment-associated outcomes are reported in the SINAN database.Thus, this is a significant source of data that could be explored to develop prediction models for LTFU during anti-TB treatment (ATT).
Therefore, we aimed to develop a web-based prediction model for LTFU among pulmonary TB treatment cases in Brazil at the baseline consultation utilizing secondary data elements readily available at diagnosis.Importantly, the developed a model that could be used by both the Brazilian government and clinicians as a readily available web-based tool for decision-making to achieve higher rates of TB treatment success.

Ethics statement
All data accessed in this study were obtained from a publicly available platform and pre-processed by the Brazilian Ministry of Health (https://datasus.saude.gov.br) This processing verified the data regarding consistency, duplicate registration, and completeness, following the instructions set by Resolution Number 466/12 on Research Ethics of the National Health Council, Brazil.There was no identifiable information in the databases and thus the study was exempt from approval by ethics committees.

Study population
We performed a retrospective analysis of de-identified data from pulmonary TB cases reported to the Brazilian Notifiable Diseases Information System (SINAN).
SINAN is a centralized system for the notification of transmissible diseases, including TB.Data stored in SINAN are maintained by the Brazilian Ministry of Health specifically by the DATASUS (the Information Technology Department of the Brazilian Unified Health System) and can be accessed through a file transfer protocol [6].
We included in our study all individuals 18 years old or older, notified in SINAN with pulmonary TB from 2015 to 2022.We exclude from our study any patient that: (i) postmortem TB diagnosed; (ii) belongs to any special population (i.e.homelessness, liberty deprivation, pregnant, immigrants, and health worker), (iii) is resistant to any drug (rifampin, isoniazid, pyrazinamide, or ethambutol), (iv) outcome other than cure or LTFU, and with PTB and also had > = 1 EPTB site.(Fig. 1).Vulnerable populations were removed because they present a different pattern of risk of illness and LTFU than the general population.

Variables definitions
The age variable was categorized using the following bins: children/teenage (0, 18], Young adult (18,35], Adult (35, 50], Senior adult (50, 65] and Eldery > 80 years old.Biological sex: female or male, HIV infection: presence of an HIV diagnosis (self-reported); alcohol consumption: ever use of alcohol; tobacco use: ever smoking tobacco; drug use: ever use of drugs (including marijuana, cocaine, heroin or crack); race: self-reported races/ ethnicities, subdivided into Non-White (including "Yellow", "Black", "Pardo", which defines mixed-race ancestry in Latin America [European, Indigenous and African], and Indigenous) and White; DOT: implementation of directly observed therapy; schooling: self-reported years of schooling.abnormal chest X-ray: thorax radiographic result indicative of TB; sensibility TB test: susceptible to all first-line drugs, resistant to any drug; smear grade: positive, negative, not performed.Comorbidities such as diabetes and mental illness were classified according to the presence or abstence in the moment of the TB diagnois (self-reported).Prior TB: patient report a history of TB treatment.This stratification was performed following criterion adopted by Brazilian Ministry of Health to report TB data [8].

Data analyses
We divided our data analysis process into seven portions/ steps: (i) descriptive analyses, (ii) data under sample, (iii) split data, (iv) feature elimination, (v) hyper-parameters tuning, (vi) model evaluation, and (vii) model building.
To conduct descriptive analysis we used median followed by interval interquartile (IQR) to describe continuous variable and absolute and relative frequency to categorical.As our data could be considered imbalanced (i.e.~3 cures for 1 LTFU) we performed an under sample of the most frequent class [9].Hence, the data set resulting from this process has the same proportion of outcome (i.e. 1 cure for 1 LTFU), and then we split in train test data [10].The training set was composed by 70% of the total data whereas 30% was kept for model evaluation.To reduce data dimensionality, we used Recursive Feature Elimination using Cross-Validation (RFECV) [11].In this case, we selected RF as the estimator and used it in a 10-fold stratified cross-validation, then we selected the minimum number of variables that leads to the higher model accuracy following the elbow rule.To find the best set of parameters we used the grid search approach, thus for each model (i.e.Logistic Regression, Random Forest, and Light Gradient Boosting [12,13]) we created a grid of parameters, in the train set we evaluated the best combination of the parameters.To select the best algorithm evaluation, we applied each model with its best combination of parameters to the test set.We then evaluate AUC, accuracy, sensitivity, and specificity [14,15].To understand the feature importance and feature contribution to each outcome on a global and local level we used Shapley values.The last step consisted of retraining the model using the whole data set [16,17].All codes are provide and could be checked at (https://github.com/rodriguesmsb/TBPrediction)

Comparing machine learning algorithms to predict LTFU
We initiated our model development with 13 variables of which 8 were selected as the most informative by our RFECV approach (Fig. 2): (i) schooling, (ii) sex, (iii) prior TB, (iv) HIV infection, (v) alcohol use, (vi) drug use, (vii) tobacco use and (vii) age.To predict those patients who are more likely to experience an LTFU we proposed three different models using the variables listed above.In our investigation into predicting patient outcomes, three diverse models were employed, each revealing unique hyperparameter preferences for optimal performance.The logistic regression model demonstrated its peak predictive capabilities with a strong regularization, notably C = 0.01.This underscored the critical role of regularization strength in striking a balance between model complexity and generalization.The RF model achieved its best performance by setting the maximum depth to 8, which means each of the model's decision trees is allowed to make decisions down to eight levels deep.Additionally, it used an ensemble of 500 decision trees, meaning the model's final prediction is based on the combined output of 500 trees.This setup highlights the critical importance of these specific settings-both the depth of decisionmaking in each tree and the total number of trees in the ensemble-for improving the model's ability to accurately predict outcomes.In the case of the Light Gradient Boosting model, optimal performance was achieved with trees of max depth 4, 500 decision trees (no. of estimators), and a learning rate of 0.01.These results highlighted the intricate interplay between tree complexity, ensemble size, and the learning rate in achieving superior predictive capabilities.
The next phase consisted of evaluating the three models (using the parameters described above) on the test set.In this case, we found that classifiers presented similar results (Supplementary Table S1).
According to our calibration plot, the Light Gradient Boosting presented the best result since the predicted probability of an LTFU corresponds to the true likelihood of the positive class being true (Supplementary Fig. S1).The Random Forest presented the worst result.In this case, the model probability underestimated the real likelihood of the positive class.Thus based, on all the results we found, we decided to use the Light Gradient Boosting to construct our predictive model (Fig. 3).We used SHAP values to allocate the contribution of each feature to a model's prediction, offering insights into feature importance and interactions.Such values help interpret complex models, providing a nuanced understanding of the factors influencing specific predictions.According to our model, previous TB was the most important feature.In this case, a patient who experienced prior TB had increased likelihood to evolve to LTFU.Another important feature was drug use.Patients who reported to use drugs had the probability of evolve to LTFU during an ATT increased (Fig. 4).

Discussion
In this study of pulmonary TB cases reported to SINAN in Brazil, we developed a risk score that effectively stratified before treatment initiation those TB cases at higher   Other comorbidities: Include cancer, kidney disease, chronic obstructive pulmonary disease, emphysema, allergies, and asthma.
risk of LTFU during ATT.Our score used data from 7 features, all of which were from the case notification form, and were publicly available.Those features included clinical and epidemiologic information, that can be collected by health professionals before treatment initiation, and which predicted LTFU independent of other characteristics.The use of this risk score could potentially provide crucial information to target specific patients since the diagnosis and improve the successful ATT completion, potentially facilitating the achievement of the WHO target of 90% of patients with treatment success [18].Importantly, in our study, 14.5% of the total population experienced LTFU, which represents an important problem for public health because of the risk of M. tuberculosis transmission; drug-resistant strains can also be generated [19].Importantly, the rates of DOT in the group that experienced the LTFU were significantly lower than the cure group.Enhancing the importance of the detection of these patients at the beginning of TB treatment might help clinicians in choosing priorities for DOT and the target populations for the Brazilian national TB program.
Our probabilistic score was developed using clinical and sociodemographic data readily collected in most clinical care settings, even in resource-limited settings.Among the variables selected, prior TB, consumption habits (alcohol, tobacco, or drug use), age (adult and elderly), biological sex, HIV infection, and schooling level were the risk factors that most contributed to an LTFU during TB treatment.Some of these characteristics have been explored and linked to unfavorable TB treatment outcomes through the relationship with poor therapy adherence, LTFU, and treatment discontinuation [20][21][22][23][24][25][26][27].It is important to highlight that our study identified history of prior TB as the variable with the most significant impact on the model's ability to predict LTFU.This finding is consistent with extensive literature, which attributes this impact to a mix of psychological factors, barriers to healthcare access, social conditions, and stigma [28][29][30][31]. .Additionally, a study using the SINAN database highlighted that a history of previous treatment abandonment is the primary risk factor for LTFU in new treatment cycles, underlining the importance of past treatment adherence in predicting and managing future outcomes [32]. .
In a previous study, a similar score was developed to predict unfavorable anti-TB treatment outcomes in people living with diabetes from China, however using clinical and radiologic data [23].Another study from Mexico developed an algorithm to predict mortality, failure, and drug resistance in newly diagnosed TB patients with clinical features and laboratory tests [27].In contrast, our score could be applied in patients with or without diabetes, by utilizing only clinical information, without the necessity of laboratory data or radiographic exams.
While exploring data from the RePORT-Brazil consortium, we have previously reported a clinical prediction model for unfavorable pulmonary TB treatment outcomes [20].That score utilized information that was not readily available in SINAN, thus we found it difficult to translate to the nationwide TB program in Brazil.The present study intended to create a score that could be Our risk model had several limitations.First, the study utilized nationwide public data, and several features had missing data and were exposed to a wide range of demographic and regional discrepancies.Second, most co-morbidities and clinical characteristics were selfreported, which may provide potential misclassification bias.The study included only pulmonary TB cases and consequently may not be applied to extrapulmonary or disseminated TB.Also, we excluded vulnerable populations, and the total number of exclusions were higher than 50% of the total cases reported limiting the use in similar populations to those included in our study.We suggest that future scores include more clinical data, physical exam, and social economic conditions to improve the accuracy and extend the applicability in clinical practice.
Despite the limitations, to the best of our knowledge, this is the first prognostic score model developed in South America using only clinical and epidemiologic data from disease notification forms, obtained before therapy initiation, with relatively accurate prediction.The resulting model is parsimonious and should be utilized by clinicians through a nomogram or web application (https:// tbprediction.onrender.com),assisting in TB care and potentially improving the successful completion of ATT of pulmonary TB patients.
Definition of age: children/teenage (0, 18], young adult (18, 35], Adult (35, 50], Senior adult (50, 65] and Eldery > 80 years old Definition of alcohol use: Past or current any consumption of alcohol Definition of smoking: Past or current smoking of tobacco.Definition of non-white race: combination of black, mixed, pardo, yellow and indigenous.Definition of drug use: Past or current drug use (marijuana, cocaine, heroin, or crack).

Fig. 3
Fig. 3 Receiver operating characteristic curve (ROC) for prediction of LTFU based on data available in SINAN using three different Machine Learning algorithm

Table 1
Characteristics of the overall population of the study

Table note :
Data represent no.(%), except for age, which is presented as median and interquartile range (IQR).